-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[pydap backend] enables downloading/processing multiple arrays within single http request #10629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
hmm - the test I see that fails (sporadically) concerns the following assertion:
where the groups have reverse ordering in the way dimensions show up ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Mikejmnez !
xarray/backends/pydap_.py
Outdated
timeout=None, | ||
verify=None, | ||
user_charset=None, | ||
batch=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have the default be batch=None
, which means "use batching if possible"? This would expose these benefits to more users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I fully understand what you mean. Do you mean? batch = None|dict
, where in the dict a user specifies which variables to download together? Or do you mean batch if dap4
?
batch= True|False
is intended to be used at the moment, as a way to download (stream) data faster, and make scalable workflows (when having to aggregate 100s of urls on the client side) by downloading multiple variables at once (single url).
ds = xr.open_mfdataset(urls, engine='pydap', ...., batch=True)
ds.(<define_slice_here>).to_zarr # .to_netcdf or whatever...
and so per dataset, you get roughly a single dap url with all variables.
NOTE: I did make the change to batch = None
as default, and I am up for setting batch = None | dict
to enable broader usage in the future. pydap
could easily support the dict aspect. For now is *all*
available or None
.
batch = None|dict
I see the benefit to setting batch = None|dict
to specify which variables to download together. But with opendap urls, you can already specify a filter to reduce, from the original source file, which variables to access to. For example:
new_url = base_url + "?dap4.ce=/var1;/var2;/....;/VarN"
where N<=M
amount of variables in the original remote file.
(note this is very different from xarray.Dataset.drop_variables
, since xarray first parses all M
variables and then it discard the M-N
variables --> not very useful when M~O(1000)
and N~O(1)
).
batch if dap4 (if possible)
This is a bit tricky. Some servers are configured to provide a single opendap url for an aggregated view of the entire dataset (an .ncml
). This is for both dap2
and dap4
protocol. For opendap servers in the cloud, this is not used (not sure if it is possible). And so this batch=True
makes most sense for the non-aggregated views of the dataset.
I think the danger would be when using batch=True
on an aggregated view of the dataset, as it would attempt to download all of it on a single request.
This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap. |
I'm seeing the same error over here: Not quite sure what to make of this, but seems to be a separate bug. |
Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :) |
|
||
def get_dimensions(self): | ||
return Frozen(self.ds.dimensions) | ||
return Frozen(sorted(self.ds.dimensions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To potentially address the issues with dimensions in Datatree, and the lat/lon
dimensions being inconsistently ordered, I added this sorted
to the dimensions list that the backend gets from the Pydap dataset directly. Hopefully this little fix will make it go away, but I will continue checking this issue locally and after merging main into this PR (it has not failed once yet! knocks on wood)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only dataset level dimensions, not variable level dimensions.
At the dataset level, dimension order doesn't really matter, so I doubt this is going to fix the issue, unfortunately.
b3c77a0
to
aaa07c4
Compare
@shoyer I had a second go at this finally. Moved much of the logic to the backend. Here is the current state of things:
|
1687221
to
20fb5cd
Compare
@shoyer This is ready for further reviewing. Pydap has a new release that fixes some issues on the backend xml parser (there was a bug that got fixed). I think there may be some additional work to be needed in the next couple of weeks, but these are unrelated to this PR anyways...
|
9c15100
to
6c45f50
Compare
… all together in single dap url
…ed at once (per group)
…stall after new release if no further change to backend
aac3163
to
4b516b4
Compare
@shoyer Let me know if there is any feedback, concerns, further reviewing, etc. This PR enables a new (non-default) feature that was added to the pydap backend over the span of several months, namely the ability to download multiple variables within single request, according to the opendap spec. Without this feature, each variable is downloaded separately, which does not take advantage of the opendap protocol, and can make pydap unusable when each remote file has ~>2-3 variables, and there are at least >10 urls to consolidate (for example via mds = xr.open_mfdataset and then mdf.to_zarr or something). This PR also makes it so that when accessing via dap4 protocol, all dimensions are downloaded within single request by default, always. This is the most performant approach compared to downloading each dimension using a separate request. This again improves performance when "only opening" multiple remote files. |
whats-new.rst
With this PR, the following is true:
And so the dimensions are batched (downloaded) together in same always in DAP4.
In addition to this, and to preserve backwards functionality before, I added an backend argument
batch=True | False
. Whenbatch=True
, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).When
batch=False
, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.When
batch=False
(False
is the default) , the last step (ds.load()
) triggers individual downloads.These changes allow a more performant download experience with xarray+pydap.
However ,must of these changes depend on a yet-to-release version of pydap (pydap3.5.6
). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code.3.5.6
has been released!